Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho
نویسندگان
چکیده
Tokenisation is an important first pre-processing step required to adequately test finite-state morphological analysers. In agglutinative languages each morpheme is concatinatively added on to form a complete morphological structure. Disjunctive agglutinative languages like Northern Sotho write these morphemes, for certain morphological categories only, as separate words separated by spaces or line breaks. These breaks are, by their nature, different from breaks that separate ``words'' that are written conjunctively. A tokeniser is required to isolate categories, like a verb, from raw text before they can be correctly morphologically analysed. The authors have successfully produced a finite state tokeniser for Northern Sotho, where verb segments are written disjunctively but nominal segments conjunctively. The authors show that since reduplication in the Northern Sotho language does not affect the pre-processing tokeniser, the disjunctive standard verbal segment as a construct in Northern Sotho is deterministic, finite-state and a regular Type 0 language in the Chomsky hierarchy and that the copulative verbal segment, due to its semi-disjunctivism, is ambiguously non-deterministic.
منابع مشابه
Morphosyntactic discrepancies in representing the adjective equivalent in African WordNet with reference to Northern Sotho
This paper aims to highlight morphosyntactic discrepancies encountered in representing the adjective equivalent in African WordNet, with reference to Northern Sotho. Northern Sotho is an agglutinating language with rich and productive morphology. The language also features a disjunctive orthographic system. The orthography determines the attachment selection of morphemes. The immediate issue, i...
متن کاملRealisations of a single high tone in Northern Sotho
This article reports on a production study that investigates the realisation of a single high tone in the verbal constituent in Northern Sotho, a Bantu language spoken in South Africa. The parameters of variation investigated are based on existing descriptive and theoretical literature and relate to numbers of syllables in the verb stem, morphosyntactic constituency and verb-internal morphologi...
متن کاملGrammar-based tools for the creation of tagging resources for an unresourced language: the case of Northern Sotho
We describe an architecture for the parallel construction of a tagger lexicon and an annotated reference corpus for the part-of-speech tagging of Nothern Sotho, a Bantu language of South Africa, for which no tagged resources have been available so far. Our tools make use of grammatical properties (morphological and syntactic) of the language. We use symbolic pretagging, followed by stochastic t...
متن کاملIdentifying phonological processing deficits in Northern Sotho-speaking children: The use of non-word repetition as a language assessment tool in the South African context
Diagnostic testing of speech/language skills in the African languages spoken in South Africa is a challenging task, as standardised language tests in the official languages of South Africa barely exist. Commercially available language tests are in English, and have been standardised in other parts of the world. Such tests are often translated into African languages, a practice that speech langu...
متن کاملThe Production of Nominal and Verbal Inflection in an Agglutinative Language: Evidence from Hungarian
The contrast between regular and irregular inflectional morphology has been useful in investigating the functional and neural architecture of language. However, most studies have examined the regular/irregular distinction in non-agglutinative Indo-European languages (primarily English) with relatively simple morphology. Additionally, the majority of research has focused on verbal rather than no...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006